{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3-Way Sentiment Analysis for Tweets\n",
    "\n",
    "#### Overview\n",
    "\n",
    "In this project, we'll build a 3-way polarity (positive, negative, neutral) classification system for tweets, without using NLTK's in-built sentiment analysis engine. \n",
    "\n",
    "We'll use a logistic regression classifier, bag-of-words features, and polarity lexicons (both in-built and external). We'll also create our own pre-processing module to handle raw tweets. \n",
    "\n",
    "#### Data Used\n",
    "\n",
    "- training.json: This file contains ~15k raw tweets, along with their polarity labels (1 = positive, 0 = neutral, -1 = negative). We'll use this file to train our classifiers.\n",
    "\n",
    "- develop.json: In the same format as training.json, the file contains a smaller set of tweets. We'll use it to test the predictions of our classifiers which were trained on the training set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Preprocessing\n",
    "\n",
    "The first thing that we'll do is preprocess the tweets so that they're easier to deal with, and ready for feature extraction, and training by the classifiers.\n",
    "\n",
    "To start with we're going to extract the tweets from the json file, read each line and store the tweets, labels in separate lists.\n",
    "\n",
    "Then for the preprocessing, we'll:\n",
    "\n",
    "- __segment__ tweets into sentences using an NTLK segmenter\n",
    "- __tokenize__ the sentences using an NLTK tokenizer\n",
    "- __lowercase__ all the words\n",
    "- __remove twitter usernames__ beginning with @ using regex\n",
    "- __remove URLs__ starting with http using regex\n",
    "- __process hashtags__ ,for this we'll tokenize hashtags,and try to break down multi-word hashtags using a MaxMatch algorithm, and the English word dictionary supplied with NLTK.\n",
    "\n",
    "Let's build some functions to accomplish all this."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import re\n",
    "import nltk"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "[nltk_data] Downloading package words to /Users/sajal/nltk_data...\n",
      "[nltk_data]   Package words is already up-to-date!\n",
      "[nltk_data] Downloading package omw-1.4 to /Users/sajal/nltk_data...\n",
      "[nltk_data]   Package omw-1.4 is already up-to-date!\n",
      "[nltk_data] Downloading package sentiwordnet to\n",
      "[nltk_data]     /Users/sajal/nltk_data...\n",
      "[nltk_data]   Package sentiwordnet is already up-to-date!\n",
      "[nltk_data] Downloading package word2vec_sample to\n",
      "[nltk_data]     /Users/sajal/nltk_data...\n",
      "[nltk_data]   Package word2vec_sample is already up-to-date!\n",
      "[nltk_data] Downloading package opinion_lexicon to\n",
      "[nltk_data]     /Users/sajal/nltk_data...\n",
      "[nltk_data]   Unzipping corpora/opinion_lexicon.zip.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 42,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "nltk.download('words')\n",
    "nltk.download('omw-1.4')\n",
    "nltk.download('sentiwordnet')\n",
    "nltk.download('word2vec_sample')\n",
    "nltk.download('opinion_lexicon')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "lemmatizer = nltk.stem.wordnet.WordNetLemmatizer()\n",
    "dictionary = set(nltk.corpus.words.words()) #To be used for MaxMatch\n",
    "\n",
    "#Function to lemmatize word | Used during maxmatch\n",
    "def lemmatize(word):\n",
    "    lemma = lemmatizer.lemmatize(word,'v')\n",
    "    if lemma == word:\n",
    "        lemma = lemmatizer.lemmatize(word,'n')\n",
    "    return lemma\n",
    "\n",
    "#Function to implement the maxmatch algorithm for multi-word hashtags\n",
    "def maxmatch(word,dictionary):\n",
    "    if not word:\n",
    "        return []\n",
    "    for i in range(len(word),1,-1):\n",
    "        first = word[0:i]\n",
    "        rem = word[i:]\n",
    "        if lemmatize(first).lower() in dictionary: #Important to lowercase lemmatized words before comparing in dictionary. \n",
    "            return [first] + maxmatch(rem,dictionary)\n",
    "    first = word[0:1]\n",
    "    rem = word[1:]\n",
    "    return [first] + maxmatch(rem,dictionary)\n",
    "\n",
    "#Function to preprocess a single tweet\n",
    "def preprocess(tweet):\n",
    "    \n",
    "    tweet = re.sub(\"@\\w+\",\"\",tweet).strip()\n",
    "    tweet = re.sub(\"http\\S+\",\"\",tweet).strip()\n",
    "    hashtags = re.findall(\"#\\w+\",tweet)\n",
    "    \n",
    "    tweet = tweet.lower()\n",
    "    tweet = re.sub(\"#\\w+\",\"\",tweet).strip() \n",
    "    \n",
    "    hashtag_tokens = [] #Separate list for hashtags\n",
    "    \n",
    "    for hashtag in hashtags:\n",
    "        hashtag_tokens.append(maxmatch(hashtag[1:],dictionary))        \n",
    "    \n",
    "    segmenter = nltk.data.load('tokenizers/punkt/english.pickle')\n",
    "    segmented_sentences = segmenter.tokenize(tweet)\n",
    "    \n",
    "    #General tokenization\n",
    "    processed_tweet = []\n",
    "    \n",
    "    word_tokenizer = nltk.tokenize.regexp.WordPunctTokenizer()\n",
    "    for sentence in segmented_sentences:\n",
    "        tokenized_sentence = word_tokenizer.tokenize(sentence.strip())\n",
    "        processed_tweet.append(tokenized_sentence)\n",
    "    \n",
    "    #Processing the hashtags only when they exist in a tweet\n",
    "    if hashtag_tokens:\n",
    "        for tag_token in hashtag_tokens:\n",
    "            processed_tweet.append(tag_token)\n",
    "    \n",
    "    return processed_tweet\n",
    "    \n",
    "#Custom function that takes in a file, and passes each tweet to the preprocessor\n",
    "def preprocess_file(filename):\n",
    "    tweets = []\n",
    "    labels = []\n",
    "    f = open(filename)\n",
    "    for line in f:\n",
    "        tweet_dict = json.loads(line)\n",
    "        tweets.append(preprocess(tweet_dict[\"text\"]))\n",
    "        labels.append(int(tweet_dict[\"label\"]))\n",
    "    return tweets,labels"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Before we run preprocess our training data, let's see how well the maxmatch algorithm works."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['we', 'can']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "maxmatch('wecan',dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's try feeding it something harder than that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['cases', 'tu', 'd', 'y']"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "maxmatch('casestudy',dictionary)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see from the above example, it incorrectly breks down the word 'casestudy', by returning 'cases', instead of 'case' for the first iteration., which would have been a better output. This is because it _greedily_ extract 'cases' first.\n",
    "\n",
    "For an improvement, we can implement an algorithm that better counts the total number of successful matches in the result of the maxmatch process, and return the one with the highest successful match count.\n",
    "\n",
    "Let's run our preprocessing module on the raw training data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Running the basic preprocessing module and capturing the data (maybe shift to the next block)\n",
    "train_data = preprocess_file('data/sentiment/training.json')\n",
    "train_tweets = train_data[0]\n",
    "train_labels = train_data[1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's print out the first couple processed tweets:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[['dear', 'the', 'newooffice', 'for', 'mac', 'is', 'great', 'and', 'all', ',', 'but', 'no', 'lync', 'update', '?'], ['c', \"'\", 'mon', '.']], [['how', 'about', 'you', 'make', 'a', 'system', 'that', 'doesn', \"'\", 't', 'eat', 'my', 'friggin', 'discs', '.'], ['this', 'is', 'the', '2nd', 'time', 'this', 'has', 'happened', 'and', 'i', 'am', 'so', 'sick', 'of', 'it', '!']]]\n"
     ]
    }
   ],
   "source": [
    "print(train_tweets[:2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hmm.. we can do better than that to make sense of what's happening. Let's write a simple script to that'll run the preprocessing module on a few tweets, and print the original and processed results, side by side; if it detects a multi-word hashtag."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1. Original Tweet: If I make a game as a #windows10 Universal App. Will #xboxone owners be able to download and play it in November? @majornelson @Microsoft\n",
      "Processed tweet: [['if', 'i', 'make', 'a', 'game', 'as', 'a', 'universal', 'app', '.'], ['will', 'owners', 'be', 'able', 'to', 'download', 'and', 'play', 'it', 'in', 'november', '?'], ['windows', '1', '0'], ['x', 'box', 'one']]\n",
      "\n",
      "2. Original Tweet: Microsoft, I may not prefer your gaming branch of business. But, you do make a damn fine operating system. #Windows10 @Microsoft\n",
      "Processed tweet: [['microsoft', ',', 'i', 'may', 'not', 'prefer', 'your', 'gaming', 'branch', 'of', 'business', '.'], ['but', ',', 'you', 'do', 'make', 'a', 'damn', 'fine', 'operating', 'system', '.'], ['Window', 's', '1', '0']]\n",
      "\n",
      "3. Original Tweet: @MikeWolf1980 @Microsoft I will be downgrading and let #Windows10 be out for almost the 1st yr b4 trying it again. #Windows10fail\n",
      "Processed tweet: [['i', 'will', 'be', 'downgrading', 'and', 'let', 'be', 'out', 'for', 'almost', 'the', '1st', 'yr', 'b4', 'trying', 'it', 'again', '.'], ['Window', 's', '1', '0'], ['Window', 's', '1', '0', 'fail']]\n",
      "\n",
      "4. Original Tweet: @Microsoft 2nd computer with same error!!! #Windows10fail Guess we will shelve this until SP1! http://t.co/QCcHlKuy8Q\n",
      "Processed tweet: [['2nd', 'computer', 'with', 'same', 'error', '!!!'], ['guess', 'we', 'will', 'shelve', 'this', 'until', 'sp1', '!'], ['Window', 's', '1', '0', 'fail']]\n",
      "\n",
      "5. Original Tweet: Sunday morning, quiet day so time to welcome in #Windows10 @Microsoft @Windows http://t.co/7VtvAzhWmV\n",
      "Processed tweet: [['sunday', 'morning', ',', 'quiet', 'day', 'so', 'time', 'to', 'welcome', 'in'], ['Window', 's', '1', '0']]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "#Printing examples of multi-word hashtags (Doesn't work for multi sentence tweets)\n",
    "f = open('data/sentiment/training.json')\n",
    "count = 1\n",
    "for index,line in enumerate(f):\n",
    "    if count >5:\n",
    "        break\n",
    "    original_tweet = json.loads(line)[\"text\"]\n",
    "    hashtags = re.findall(\"#\\w+\",original_tweet)\n",
    "    if hashtags:\n",
    "        for hashtag in hashtags:\n",
    "            if len(maxmatch(hashtag[1:],dictionary)) > 1:\n",
    "                #If the length of the array returned by the maxmatch function is greater than 1,\n",
    "                #it means that the algorithm has detected a hashtag with more than 1 word inside. \n",
    "                print(str(count) + \". Original Tweet: \" + original_tweet + \"\\nProcessed tweet: \" + str(train_tweets[index]) + \"\\n\")\n",
    "                count += 1\n",
    "                break"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's better! Our pre-processing module is working as intended.\n",
    "\n",
    "The next step is to convert each processed tweet into a bag-of-words feature dictionary. We'll allow for options to remove stopwords during the process, and also to remove _rare_ words, i.e. words occuring less than n times across the whole training set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from nltk.corpus import stopwords\n",
    "\n",
    "stopwords = set(stopwords.words('english'))\n",
    "\n",
    "#To identify words appearing less than n times, we're creating a dictionary for the whole training set\n",
    "\n",
    "total_train_bow = {}\n",
    "\n",
    "for tweet in train_tweets:\n",
    "    for segment in tweet:\n",
    "        for token in segment:\n",
    "            total_train_bow[token] = total_train_bow.get(token,0) + 1\n",
    "\n",
    "#Function to convert pre_processed tweets to bag of words feature dictionaries\n",
    "#Allows for options to remove stopwords, and also to remove words occuring less than n times in the whole training set.            \n",
    "def convert_to_feature_dicts(tweets,remove_stop_words,n): \n",
    "    feature_dicts = []\n",
    "    for tweet in tweets:\n",
    "        # build feature dictionary for tweet\n",
    "        feature_dict = {}\n",
    "        if remove_stop_words:\n",
    "            for segment in tweet:\n",
    "                for token in segment:\n",
    "                    if token not in stopwords and (n<=0 or total_train_bow[token]>=n):\n",
    "                        feature_dict[token] = feature_dict.get(token,0) + 1\n",
    "        else:\n",
    "            for segment in tweet:\n",
    "                for token in segment:\n",
    "                    if n<=0 or total_train_bow[token]>=n:\n",
    "                        feature_dict[token] = feature_dict.get(token,0) + 1\n",
    "        feature_dicts.append(feature_dict)\n",
    "    return feature_dicts"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have our function to convert raw tweets to feature dictionaries, let's run it on our training and development data. We'll also convert the feature dictionary to a [sparse representation](http://scikit-learn.org/stable/modules/feature_extraction.html), so that it can be used by scikit's ML algorithms."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.feature_extraction import DictVectorizer\n",
    "vectorizer = DictVectorizer()\n",
    "\n",
    "#Conversion to feature dictionaries\n",
    "train_set = convert_to_feature_dicts(train_tweets,True,2)\n",
    "\n",
    "dev_data = preprocess_file('data/sentiment/develop.json')\n",
    "\n",
    "dev_set = convert_to_feature_dicts(dev_data[0],False,0)\n",
    "\n",
    "#Conversion to sparse representations\n",
    "training_data = vectorizer.fit_transform(train_set)\n",
    "\n",
    "development_data = vectorizer.transform(dev_set)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Classifying\n",
    "\n",
    "Now, we'll run our data through a decision tree classifier, and try to tune the parameters by using Grid Search over parameter combinations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimal parameters for DT: {'criterion': 'gini', 'max_features': None, 'min_samples_leaf': 75}\n",
      "\n",
      "Decision Tree Accuracy: 0.48004373974849646\n"
     ]
    }
   ],
   "source": [
    "from sklearn.tree import DecisionTreeClassifier\n",
    "from sklearn.metrics import accuracy_score, classification_report\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "\n",
    "#Grid used to test the combinations of parameters\n",
    "tree_param_grid = [\n",
    "    {'criterion':['gini','entropy'], 'min_samples_leaf': [75,100,125,150,175], 'max_features':['sqrt','log2',None],\n",
    "    }\n",
    "]\n",
    "\n",
    "tree_clf = GridSearchCV(DecisionTreeClassifier(),tree_param_grid,cv=10,scoring='accuracy')\n",
    "\n",
    "tree_clf.fit(training_data,train_data[1])\n",
    "\n",
    "print(\"Optimal parameters for DT: \" + str(tree_clf.best_params_)) #To print out the best discovered combination of the parameters\n",
    "\n",
    "tree_predictions = tree_clf.predict(development_data)\n",
    "\n",
    "print(\"\\nDecision Tree Accuracy: \" + str(accuracy_score(dev_data[1],tree_predictions)))\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The decision tree classifier doesn't seem to work very well, but we still don't have a benchmark to compare it with. \n",
    "\n",
    "Let's run our data through a dummy classifier which'll pick the most frequently occuring class as the output, each time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "Most common class baseline accuracy: 0.42044833242208857\n"
     ]
    }
   ],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
    "\n",
    "#The dummy classifier below always predicts the most frequent class, as specified in the strategy. \n",
    "dummy_clf = DummyClassifier(strategy='most_frequent')\n",
    "dummy_clf.fit(development_data,dev_data[1])\n",
    "dummy_predictions = dummy_clf.predict(development_data)\n",
    "\n",
    "print(\"\\nMost common class baseline accuracy: \" + str(accuracy_score(dev_data[1],dummy_predictions)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can see that out DT classifier at least performs better than the dummy classifier.\n",
    "\n",
    "We'll do the same process for logisitc regression classifier now. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Optimal parameters for LR: {'C': 0.012, 'multi_class': 'multinomial', 'solver': 'lbfgs'}\n",
      "Logistic Regression Accuracy: 0.4931656642974303\n"
     ]
    }
   ],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "\n",
    "log_param_grid = [\n",
    "    {'C':[0.012,0.0125,0.130,0.135,0.14],\n",
    "     'solver':['lbfgs'],'multi_class':['multinomial']\n",
    "    }\n",
    "]\n",
    "\n",
    "log_clf = GridSearchCV(LogisticRegression(max_iter=400),log_param_grid,cv=10,scoring='accuracy')\n",
    "\n",
    "log_clf.fit(training_data,train_data[1])\n",
    "\n",
    "log_predictions = log_clf.predict(development_data)\n",
    "\n",
    "print(\"Optimal parameters for LR: \" + str(log_clf.best_params_))\n",
    "\n",
    "print(\"Logistic Regression Accuracy: \" + str(accuracy_score(dev_data[1],log_predictions)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To recap what just happened, we created a logistic regression classifier by doing a grid search for the best parameters for C (regularization parameter), solver type, and multi_class handling, just like we did for the decision tree classifier. \n",
    "\n",
    "We also created a dummy classifier that just picks the most common class in the development set for each prediction.\n",
    "\n",
    "The table below describes the different classifiers and their accuracy scores.\n",
    "\n",
    "<table>\n",
    "<tr>\n",
    "<th>Classifier</th>\n",
    "<th>Approx. Accuracy score (in %)</th>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>Dummy classifier (most common class)</td>\n",
    "<td>42</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>Decision Tree classifier</td>\n",
    "<td>48.7</td>\n",
    "</tr>\n",
    "<tr>\n",
    "<td>Logistic Regression classifier</td>\n",
    "<td>49.3</td>\n",
    "</tr>\n",
    "</table>\n",
    "\n",
    "As we can see, both classifiers are better than the 'dummy' classifier which just picks the most common class all the time."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polarity Lexicons\n",
    "\n",
    "Now, we'll try to integrate external information into the training set, in the form polarity scores for the tweets. \n",
    "\n",
    "We'll build two automatic lexicons, compare it with NLTK's manually annotated set, and then add that information to our training data.\n",
    "\n",
    "The __first lexicon__ will be built through SentiWordNet. This has pre-calculated scores positive, negative and neutral sentiments for some words in WordNet. As this information is arranged in the form of synsets, we'll just take the most common polarity across its senses (and take neutral in case of a tie). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Positive words: ['toffee-nosed', 'auspiciousness', 'appetent', 'carminative', 'identity']\n",
      "Negative Words: ['metrorrhagia', 'no-frills', 'overawed', 'offensive', 'contempt']\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import sentiwordnet as swn\n",
    "from nltk.corpus import wordnet as wn\n",
    "import random\n",
    "\n",
    "swn_positive = []\n",
    "\n",
    "swn_negative = []\n",
    "\n",
    "#Function supplied with the assignment, not described below.\n",
    "def get_polarity_type(synset_name):\n",
    "    swn_synset =  swn.senti_synset(synset_name)\n",
    "    if not swn_synset:\n",
    "        return None\n",
    "    elif swn_synset.pos_score() > swn_synset.neg_score() and swn_synset.pos_score() > swn_synset.obj_score():\n",
    "        return 1\n",
    "    elif swn_synset.neg_score() > swn_synset.pos_score() and swn_synset.neg_score() > swn_synset.obj_score():\n",
    "        return -1\n",
    "    else:\n",
    "        return 0\n",
    "\n",
    "\n",
    "for synset in wn.all_synsets():      \n",
    "    \n",
    "    # count synset polarity for each lemma\n",
    "    pos_count = 0\n",
    "    neg_count = 0\n",
    "    neutral_count = 0\n",
    "    \n",
    "    for lemma in synset.lemma_names():\n",
    "        for syns in wn.synsets(lemma):\n",
    "            if get_polarity_type(syns.name())==1:\n",
    "                pos_count+=1\n",
    "            elif get_polarity_type(syns.name())==-1:\n",
    "                neg_count+=1\n",
    "            else:\n",
    "                neutral_count+=1\n",
    "    \n",
    "    if pos_count > neg_count and pos_count >= neutral_count: #>=neutral as words that are more positive than negative, \n",
    "                                                                #despite being equally neutral might belong to positive list (explain)\n",
    "        swn_positive.append(synset.lemma_names()[0])\n",
    "    elif neg_count > pos_count and neg_count >= neutral_count:\n",
    "        swn_negative.append(synset.lemma_names()[0])       \n",
    "\n",
    "swn_positive = list(set(swn_positive))\n",
    "swn_negative = list(set(swn_negative))\n",
    "            \n",
    "            \n",
    "print('Positive words: ' + str(random.sample(swn_positive,5)))\n",
    "\n",
    "print('Negative Words: ' + str(random.sample(swn_negative,5)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "I'll try and explain what happened. \n",
    "\n",
    "To calculate the polarity of a synset across its senses, the lemma names were extracted from the synset to get its 'senses'. Then, each of those lemma names were converted to a synset object, which was then passed to the pre-supplied 'get_polarity_type' function. Based on the score returned, the head lemma of the synset object was appended to the relevant list. The head lemma was chosen from the lemma_names, as it best represents the synset object.\n",
    "\n",
    "As the code above returns a random sample of positive and negative words each time, the words returned when I ran the code the first time (different from the above) were:\n",
    "\n",
    "Positive words: [u'counterblast', u'unperceptiveness', u'eater', u'white_magic', u'cuckoo-bumblebee']\n",
    "Negative Words: [u'sun_spurge', u'pinkness', u'hardness', u'unready', u'occlusive']\n",
    "\n",
    "At first glance, they seem like a better than average sample of negative words, and a worse than average sample of positive ones.\n",
    "\n",
    "This might be due to the fact that, when looking at a word like 'unperceptiveness', which is a positive word prefixed to convert into a negative one, or an antonym. It's lemmas/senses might contain more positive senses of 'perceptiveness' than negative ones, and has hence been classified as a positive word, which might be wrong."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For the __second lexicon__, we will use the word2vec (CBOW) vectors included in NLTK. \n",
    "\n",
    "Using a small set of positive and negative seed terms, we will calculate the cosine similarity between vectors of seed terms and another word. We can use Gensim to iterate over words in model.vocab for comparison over seed terms. \n",
    "\n",
    "After calculating the cosine similarity of a word with both the positive and negative terms, we'll calculate their average, after flipping the sign for negative seeds. A threshold of ±0.03 will be used to determine if words are positive or negative. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 40,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Positive words: ['quality', 'complimenting', 'will', 'newel', 'coolest']\n",
      "Negative Words: ['wards', 'stoned', 'straggled', 'deplores', 'prone']\n"
     ]
    }
   ],
   "source": [
    "import gensim\n",
    "from nltk.data import find\n",
    "import random\n",
    "\n",
    "positive_seeds = [\"good\",\"nice\",\"excellent\",\"positive\",\"fortunate\",\"correct\",\"superior\",\"great\"]\n",
    "negative_seeds = [\"bad\",\"nasty\",\"poor\",\"negative\",\"unfortunate\",\"wrong\",\"inferior\",\"awful\"]\n",
    "\n",
    "word2vec_sample = str(find('models/word2vec_sample/pruned.word2vec.txt'))\n",
    "model = gensim.models.KeyedVectors.load_word2vec_format(word2vec_sample,binary=False)\n",
    "\n",
    "wv_positive = []\n",
    "wv_negative = []\n",
    "\n",
    "for word in model.index_to_key:\n",
    "    try:\n",
    "        word=word.lower()\n",
    "    \n",
    "        pos_score = 0.0\n",
    "        neg_score = 0.0\n",
    "    \n",
    "        for seed in positive_seeds:\n",
    "            pos_score = pos_score + model.similarity(word,seed)\n",
    "    \n",
    "        for seed in negative_seeds:\n",
    "            neg_score = neg_score + model.similarity(word,seed)\n",
    "        \n",
    "        avg = (pos_score - neg_score)/16 #Total number of seeds is 16\n",
    "    \n",
    "        if avg>0.03:\n",
    "            wv_positive.append(word)\n",
    "        elif avg<-0.03:\n",
    "            wv_negative.append(word)\n",
    "    except:\n",
    "        pass\n",
    "    \n",
    "\n",
    "print('Positive words: ' + str(random.sample(wv_positive,5)))\n",
    "\n",
    "print('Negative Words: ' + str(random.sample(wv_negative,5)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Again, the code randomises the printed positive and negative words. Tn my first instance, they were:\n",
    "\n",
    "Positive words: [u'elegant', u'demonstrated', u'retained', u'titles', u'strengthen']\n",
    "Negative Words: [u'scathingly', u'anorexia', u'rioted', u'blunders', u'alters']\n",
    "\n",
    "This looks like a great set of both positive negative words, looking at the samples. But let's see how it compares with NLTK's manually annotated set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Hu and Liu lexicon included with NLTK, has a list of positive and negative words.\n",
    "\n",
    "First, we'll investigate what percentage of the words in the manual lexicon are in each of the automatic lexicons, and then, only for those words which overlap and which are not in the seed set, evaluate the accuracy of with each of the automatic lexicons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "% of words in manual lexicons, also present in the automatic lexicon\n",
      "First automatic lexicon: 13.610251878038001\n",
      "Second automatic lexicon: 37.796435410222415\n",
      "\n",
      "Accuracy of lexicons: \n",
      "First automatic lexicon: 84.46389496717724\n",
      "Second automatic lexicon: 98.94159153273226\n"
     ]
    }
   ],
   "source": [
    "from nltk.corpus import opinion_lexicon\n",
    "import math\n",
    "\n",
    "positive_words = opinion_lexicon.positive()\n",
    "negative_words = opinion_lexicon.negative()\n",
    "\n",
    "#Calculate the percentage of words in the manually annotated lexicon set, that also appear in an automatic lexicon.\n",
    "def get_perc_manual(manual_pos,manual_neg,auto_pos,auto_neg):\n",
    "    return len(set(manual_pos+manual_neg).intersection(set(auto_pos+auto_neg)))/len(manual_pos+manual_neg)*100\n",
    "\n",
    "print(\"% of words in manual lexicons, also present in the automatic lexicon\")\n",
    "print(\"First automatic lexicon: \"+ str(get_perc_manual(positive_words,negative_words,swn_positive,swn_negative)))\n",
    "print(\"Second automatic lexicon: \"+ str(get_perc_manual(positive_words,negative_words,wv_positive,wv_negative)))\n",
    "\n",
    "#Calculate the accuracy of words in the automatic lexicon. Assuming that the manual lexicons are accurate, it calculates the percentage of words that occur in both positive and negative (respectively) lists of automatic and manual lexicons.\n",
    "def get_lexicon_accuracy(manual_pos,manual_neg,auto_pos,auto_neg):\n",
    "    common_words = set(manual_pos+manual_neg).intersection(set(auto_pos+auto_neg))-set(negative_seeds)-set(positive_seeds)\n",
    "    return (len(set(manual_pos) & set(auto_pos) & common_words)+len(set(manual_neg) & set(auto_neg) & common_words))/len(common_words)*100\n",
    "\n",
    "print(\"\\nAccuracy of lexicons: \")\n",
    "print(\"First automatic lexicon: \"+ str(get_lexicon_accuracy(positive_words,negative_words,swn_positive,swn_negative)))\n",
    "print(\"Second automatic lexicon: \"+ str(get_lexicon_accuracy(positive_words,negative_words,wv_positive,wv_negative)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The second lexicon shares the most common words with the manual lexicon, and has the most accurately classified words, as it uses the most intutive way of creative positive/negative lexicons i.e. by identifying the most similar words."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Lexicons for Classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "What if we used the lexicons for the main classification problem? \n",
    "\n",
    "Let's create a function that calculates a polarity score for a sentence based on a given lexicon. We'll count the positive and negative words that appear in the tweet, and then return a +1 if there are more posiitve words, a -1 if there are more negative words, and a 0 otherwise.\n",
    "\n",
    "We'll then compare the results of the three lexicons on the development set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Manual lexicon accuracy: 45.2159650082012\n",
      "First auto lexicon accuracy: 42.3728813559322\n",
      "Second auto lexicon accuracy: 45.16129032258064\n"
     ]
    }
   ],
   "source": [
    "#All lexicons are converted to sets for faster preprocessing.\n",
    "manual_pos_set = set(positive_words)\n",
    "manual_neg_set = set(negative_words)\n",
    "\n",
    "syn_pos_set = set(swn_positive)\n",
    "syn_neg_set = set(swn_negative)\n",
    "\n",
    "wordvec_pos_set = set(wv_positive)\n",
    "wordvec_neg_set = set(wv_negative)\n",
    "\n",
    "#Function to calculate the polarity score of a sentence based on the frequency of positive or negative words. \n",
    "def get_polarity_score(sentence,pos_lexicon,neg_lexicon):\n",
    "    pos_count = 0\n",
    "    neg_count = 0\n",
    "    for word in sentence:\n",
    "        if word in pos_lexicon:\n",
    "            pos_count+=1\n",
    "        if word in neg_lexicon:\n",
    "            neg_count+=1\n",
    "    if pos_count>neg_count:\n",
    "        return 1\n",
    "    elif neg_count>pos_count:\n",
    "        return -1\n",
    "    else:\n",
    "        return 0\n",
    "    \n",
    "\n",
    "#Function to calculate the score for each tweet, and compare it against the actual labels of the dataset and calculate/count the accuracy score. \n",
    "def data_polarity_accuracy(dataset,datalabels,pos_lexicon,neg_lexicon):\n",
    "    accuracy_count = 0\n",
    "    for index,tweet in enumerate(dataset):\n",
    "        if datalabels[index]==get_polarity_score([word for sentence in tweet for word in sentence],pos_lexicon,neg_lexicon):\n",
    "            accuracy_count+=1\n",
    "    return (accuracy_count/len(dataset))*100\n",
    "        \n",
    "print(\"Manual lexicon accuracy: \"+str(data_polarity_accuracy(dev_data[0],dev_data[1],manual_pos_set,manual_neg_set))      )\n",
    "print(\"First auto lexicon accuracy: \"+str(data_polarity_accuracy(dev_data[0],dev_data[1],syn_pos_set,syn_neg_set))      )\n",
    "print(\"Second auto lexicon accuracy: \"+str(data_polarity_accuracy(dev_data[0],dev_data[1],wordvec_pos_set,wordvec_neg_set)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As we can see, the results reflect the quality metric obtained from the previous section, with the manual and second lexicon (word vector) winning out, while still not being as good as a Machine Learning algorithm without the polarity information."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polarity Lexicon with Machine Learning\n",
    "\n",
    "To conclude, we'll investigate the effects of adding the polarity score as a feature for our statistical classifier. \n",
    "\n",
    "We'll create a new version of our feature extraction function, to integrate the extra _feature_ and retrain our logisitc regression classifier to see if there's an improvement."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "def convert_to_feature_dicts_v2(tweets,manual,first,second,remove_stop_words,n): \n",
    "    feature_dicts = []\n",
    "    for tweet in tweets:\n",
    "        # build feature dictionary for tweet\n",
    "        feature_dict = {}\n",
    "        if remove_stop_words:\n",
    "            for segment in tweet:\n",
    "                for token in segment:\n",
    "                    if token not in stopwords and (n<=0 or total_train_bow[token]>=n):\n",
    "                        feature_dict[token] = feature_dict.get(token,0) + 1\n",
    "        else:\n",
    "            for segment in tweet:\n",
    "                for token in segment:\n",
    "                    if n<=0 or total_train_bow[token]>=n:\n",
    "                        feature_dict[token] = feature_dict.get(token,0) + 1\n",
    "        if manual == True:\n",
    "            feature_dict['manual_polarity'] = get_polarity_score([word for sentence in tweet for word in sentence],manual_pos_set,manual_neg_set)\n",
    "        if first == True:\n",
    "            feature_dict['synset_polarity'] = get_polarity_score([word for sentence in tweet for word in sentence],syn_pos_set,syn_neg_set)\n",
    "        if second == True:\n",
    "            feature_dict['wordvec_polarity'] = get_polarity_score([word for sentence in tweet for word in sentence],wordvec_pos_set,wordvec_neg_set)\n",
    "    \n",
    "        feature_dicts.append(feature_dict)      \n",
    "    return feature_dicts"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "training_set_v2 = convert_to_feature_dicts_v2(train_tweets,True,False,True,True,2)\n",
    "\n",
    "training_data_v2 = vectorizer.fit_transform(training_set_v2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Logistic Regression V2 (with polarity scores) Accuracy: 0.5079278294149808\n"
     ]
    }
   ],
   "source": [
    "dev_set_v2 = convert_to_feature_dicts_v2(dev_data[0],True,False,True,False,0)\n",
    "\n",
    "development_data_v2 = vectorizer.transform(dev_set_v2)\n",
    "\n",
    "log_clf_v2 = LogisticRegression(C=0.012,solver='lbfgs',multi_class='multinomial')\n",
    "\n",
    "log_clf_v2.fit(training_data_v2,train_data[1])\n",
    "\n",
    "log_predictions_v2 = log_clf_v2.predict(development_data_v2)\n",
    "\n",
    "print(\"Logistic Regression V2 (with polarity scores) Accuracy: \" + str(accuracy_score(dev_data[1],log_predictions_v2)))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Though minimal, there was some improvement indeed in the classifier by integrating the polarity data. \n",
    "\n",
    "This concludes our project of building a very basic 3-way polarity classifier for tweets."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
